Red Wine Quality

Lab Assignment Three: Extending Logistic Regression


Richmond Aisabor

Business Understanding

Wine is an alcoholic drink typically made from fermented grape juice. Different varieties of yeasts and grapes produce different styles of wine. Vinho Verde refers to wine originating from northern Portugal. Many countries have legal appellations that restrict the geographical origin, permitted grape varieties and other aspects of wine production. Vinho Verde is not a grape variety, but a protected designation of origin for the production of wine. Vinho Verde comes in red, white and rosé styles, however this analysis will focus on red vinho verde wine. The dataset contains a set of red vinho verde wine samples and the goal is to classify the wine samples by quality based on physiochemical features.

The classification model for red wine quality could be of interest to those seeking to create the perfect wine. This dataset includes 11 features, but some of these features may not be relevant to determining quality. Through feature selection, it becomes clear what exactly a winemaker should focus on when cultivating grapes to produce wine. Since this dataset presents a unique opportunity for the research and development side of the wine industry, this model is most useful for offline analysis.

Dataset source: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

Data Understanding

Data Description

Based on the dataframe information, no missing values are found in the dataset as there are 1599 entries in each feature (1599 rows in the data), however if there were missing data I could use median and mode to impute the values since all of the features are numerical.

Feature Selection

To reduce the dimensionality of this dataset, I will use this correlation matrix to determine which features highly correlate with one another. Some of the features in the dataset serve the same purpose in winemaking as shown in the data description table. The correlation between features that serve the same purpose for winemakers require special consideration because they are most likely to correlate.

Density, fixed acidity and citric acid show a strong positive correlation. Total sulfur dioxide and free sulfur dioxide also show a strong positive correlation. Fixed acidity and pH show a strong negative correlation.

Total sulfur dioxide and free sulfur dioxide are used to kill unwanted bacteria and preserve wine. Keeping both features is redundant so it is safe to remove one. Since total sulfur dioxide explains how much free sulfur dioxide is in the wine and how much is bound to other chemicals, it is best to remove the free sulfur dioxide feature from the dataset.

Although density, fixed acidity and citric acidity show strong correlations, they each explain a unique aspect of the wine so these features will stay.

pH and fixed acidity inversely correlate, but they both measure acidity. The dynamic range of pH values are 2.74 - 4.01 and any solution below a 7 is conscidered acidic. This shows that every wine sample is acidic but not much else. Since fixed acidity is a measure of titratable acids and gives an estimate of the total concentration of acid in each sample, this is a better measure. Thus it is best to remove pH because it does not do a great job of explaining acidity.

Class Descritization

The quality distribution plot shows a bimodal distribution because there are two peaks. The two peaks are at a quality of 5 and 6. This means that a random wine sample has the highest probabilty of being classified as a 5 or 6. This information helps classify the dataset even further and divide the quality values into 3 ranges. Seeing that many wine samples have a quality of 5 or 6, this range can be set as the middle range, values greater than 6 can be the upper range and values less than 5 can be the lower range. The wine samples will be divided into three classes:

Training and Testing

Cross Validation

The wine quality dataet uses 1599 instances, so it is a medium dataset. If the dataset had about 100,000 instances, it would be considered large and if the dataset had only 100 instances, it would be considered small. The goal in splitting the data should be to allow there to be enough data to train to achieve good generalization when introducing new data to classify. Therefore the number of instances is important to deciding how the data should split. With an 80/20 split, there would be more than enough instances to train classifier and not to few instances to test the accuracy of the classifier. In an 80/20 split, about 1279 instances are trained, which is still considered a medium dataset and the classifier will classify about 320 instances.

Modeling Logistic Regression

Multiclass Logistic Regression (One vs all)

Visualizing Logistic Regression

The best performing classifier is the stepest-gradient descent method using elastic-net regularization. The penalty value for C1 and C2 is 0.65 and this value was randomly selected within a dynamic range between 0 and 1. The graph measures the accuracies of each optimazation technique side by side, the color pattern distinguishes between the regularization methods and the randomly selected penalties are displayed in the hover text box.

This method avoids data snooping by selecting the penalties at random. The method does not take the accuracy score from a previous iteration of an optimization technique and use this knowledge to find optimal hyperparameters. Since the hyperparameter values are chosen at random, they can not be used to overfit the regression to the test set.

Comparing to Skikit-learn

The custom method is able to train and classify the test set much quicker than the skikit-learn method. The accuracy scores are also higher using the custom method, however only marginally. The custom model uses elastic-net regularization to introduce bias when fitting the training data, which reduces variance when fitting on the testing data. The result is a greater accuracy for subsequent predictions in the long run. Elastic-net regularization uses L1 and L2 regularization to optimize the technique so features that do not add much value to the prediction are removed and the magnitudes of the important features never become too large. Elastic-net regularization resulted in the best accuracy scores for steepest-descent and newton's method so this regularization technique must have made a great contribution to this performance.

Deployment

The implementation that should be deployed is the skikit-learn implementation. Although the custom implementation is faster, the gains in accuracy are not enough to justify deploying it over skikit-learn's logistic regression model. The skikit-learn implementation is more user friendly because there is only one required paramater so the user will not have to spend time adjusting too many hyperparameters. This is important becuause the primary users of this model will be wine experts. The time of a wine expert is better spent simply executing the model to explore wine and not tuning hyperparameters. Since this model is meant for offline analysis, execution time is not as important. This model will not be put in a high risk application like a self-driving car, so speed is not as important given the application is to further the study of wine.